Linear regression, covered in the previous chapter, is often seen as an entry method to enter the world of supervised machine learning. However, not every phenomenon can be explained using a linear relationship, and not everything is a regression. For the former, you need to use methods that have a bit more complicated math behind them (but often the same Python API). For the latter, you will often need to look for classification models. Both of these options are covered in this course, with non-linear regression models in this session and classification in the next one. Today, you’ll focus on learning the API and the key principles of supervised machine learning with scikit-learn and especially on spatial evaluation of model performance. To a degree, it is a continuation of the work covered last time, but there are some new things here and there.
This is not an introduction to ML
Note that this material does not aim to cover an introduction to machine learning thoroughly. There are other, much better materials for that. One of them can be scikit-learn’s User guide, but I am sure you will find one that suits you.
import geopandas as gpdimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdfrom libpysal import graphimport esdaimport pyinterpolatefrom sklearn import ensemble, svm, metrics, model_selection
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/distance/distance.py:176: SyntaxWarning: invalid escape sequence '\s'
"""Function calculates distance between two blocks based on how they are divided (into the point support grid).
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/theoretical/models/variogram_models.py:37: SyntaxWarning: invalid escape sequence '\g'
"""Function calculates circular model of semivariogram.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/theoretical/models/variogram_models.py:93: SyntaxWarning: invalid escape sequence '\g'
"""Function calculates cubic model of semivariogram.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/theoretical/models/variogram_models.py:167: SyntaxWarning: invalid escape sequence '\g'
"""Function calculates exponential model of semivariogram.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/theoretical/models/variogram_models.py:212: SyntaxWarning: invalid escape sequence '\g'
"""Function calculates gaussian model of semivariogram.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/theoretical/models/variogram_models.py:259: SyntaxWarning: invalid escape sequence '\g'
"""Function calculates linear model of semivariogram.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/theoretical/models/variogram_models.py:310: SyntaxWarning: invalid escape sequence '\g'
"""Function calculates power model of semivariogram.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/theoretical/models/variogram_models.py:362: SyntaxWarning: invalid escape sequence '\g'
"""Function calculates spherical model of semivariogram.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/utils/metrics.py:14: SyntaxWarning: invalid escape sequence '\s'
"""Function calculates forecast bias of prediction.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/utils/metrics.py:53: SyntaxWarning: invalid escape sequence '\s'
"""Function calculates forecast bias of prediction.
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/utils/metrics.py:93: SyntaxWarning: invalid escape sequence '\s'
"""
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/utils/metrics.py:135: SyntaxWarning: invalid escape sequence '\s'
"""
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/utils/metrics.py:206: SyntaxWarning: invalid escape sequence '\s'
"""
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/regularization/block/inblock_semivariance.py:48: SyntaxWarning: invalid escape sequence '\g'
"""
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/regularization/block/avg_inblock_semivariances.py:53: SyntaxWarning: invalid escape sequence '\g'
"""
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/variogram/regularization/deconvolution.py:783: SyntaxWarning: invalid escape sequence '\g'
"""
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/kriging/models/block/weight.py:14: SyntaxWarning: invalid escape sequence '\g'
"""
/home/runner/work/sds/sds/.pixi/envs/default/lib/python3.12/site-packages/pyinterpolate/kriging/models/block/weight.py:123: SyntaxWarning: invalid escape sequence '\g'
"""
Data
Let’s stick to a similar type of data you were working with in the previous chapter on linear regression. The data you will be working with today is, once again, representing municipalities in Czechia but this time your task will not be prediction of election results but the rate of executions (of any kind, not only those related to mortgages). The dataset is representing the proportion of population aged 15 and above in each municipality with at least one court mandated execution during the Q3 of 2024, coming from mapazadluzeni.cz by PAQ Research. The CSV is only slightly pre-processed to contain only relevant information.
Instead of reading the file directly off the web, it is possible to download it manually, store it on your computer, and read it locally. To do that, you can follow these steps:
Download the file by right-clicking on this link and saving the file
Place the file in the same folder as the notebook where you intend to read it
You have your target variable. The models will try to predict it based on a set of independent variables of which the first is the proportion of votes for Andrej Babiš, a candidate often marked as a populist or belonging to the anti-system movement (although not as strongly as some other parties in the country). You can read this data from the chapter on autocorrelation. The dataset also provides geometries of individual municipalities.
Second set of independent variables are reflecting education profile. It is the same dataset you have used last week. This time, you will use only a subset though.
Given each dataset is in its own (Geo)DataFrame now, the first thing is to merge them together to have a single GeoDataFrame to work with throughout the notebook. Each of them has a column representing code of each municipality, allowing use to easily merge the data together.
You can notice that some of the columns are in Czech. The important one is "podil_osob_v_exekuci", which stands for a ratio of people with a execution. Check how does the spatial distribution looks like, to get a sense.
Specifying minimum and maximum for the colormap based on the values used by PAQ Research. It helps filtering out outliers.
Proportion of population with executions
The executions_data GeoDataFrame has obviously more columns than we need. For the modelling purpose, select a subset of variables to treat as independent, explanatory. The election results of Andrej Babiš, three columns representing education levels, mean age and divorced rate.
Make the colorbar half the size for a better looking plot.
Spatial distribution of independent variables
Machine learning 101
As mentioned above, this material is not meant to provide exhaustive introduction to machine learning. Yet, some basics shall be explained. Start with storing independent variables and a target variable separately.
Some data are used to train the model, but the same data cannot be used for evaluation. The models tend to learn those exact values, and the performance metrics derived from training data show a much higher score than the model can on unseen data. One way around this is to split the dataset into two parts - train and test. The train part is used to train the model. However, the test part is left out of training and is later used to evaluate performance without worrying that any of the data points were seen by the model before.
scikit-learn offers handy function to split the data into train and test parts, dealing with both dependent and independent variables at the same time.
X and y are not very explanatory variable names but they are used in ML so often to reflect independent (X) and dependent (y) variables that everyone knows what they mean. You’ll see the same naming across scikit-learn’s documentation.
AndrejBabis
undetermined
masters_degree
primary_education
mean_age
divorced
4126
34.31
4.103842
12.849089
10.607054
46.251550
0.061366
6100
26.66
11.016949
10.593220
6.991525
40.045872
0.073394
5222
60.52
3.061224
4.081633
13.265306
43.833333
0.045833
1788
44.15
4.856512
9.602649
13.355408
43.606174
0.050514
546
39.88
5.284435
10.511118
11.059775
43.974484
0.056744
You can check that X_*, containing independent variables is split into two parts, one with 80% of the data and the other with remaining 20%. Completely randomly.
X_train.shape, X_test.shape
((5003, 6), (1251, 6))
Training
While there is a large number of ML models available, your goal at the moment is not to understand which ML model is better and how to fine-tune it but how to evaluate it using the spatial dimension. So, let’s not complicate the situation and stick to one of the common models - random forest.
Random forest regressor is implemented within the ensemble module of the scikit-learn and has the API you should already be familiar with. Get the training data and fit the baseline model.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(n_jobs=-1, random_state=0)
n_jobs=-1 specifies that the algorithm should use all available cores. Otherwise, it runs in a single thread only. There are also some model parameters to play with but more on that below.
The first argument is a 2-D array of independent variables, and the second is a 1-D array of labels you want to predict.
Predicting
The trained model can be directly used to predict the value, the proportion of executions. Normally, you do the prediction using the unseen portion of the data from the train-test split.
The output is a simple numpy array aligned with the values from X_test. What this array is mostly used for is the model evaluation.
Evaluation
Evaluation is usually composed a series of measurements comparing the expected and predicted values and assessing how close the result is. Regression problems typically use \(R^2\), mean absolute error, or root mean squared error among others. Let’s stick to these three for now. All are directly implemented in scikit-learn.
R-squared
\(R^2\) is a measure you should already be familiar with from the previous session on linear regression. It expresses the proportion of variation of the target variable that could be predicted from the independent variables.
r2 = metrics.r2_score(y_test, pred_test)r2
0.5590843309809401
This baseline model is able to explain about 56% of variation. Not bad, given the limited data and no fine-tuning.
Mean absolute error
The name kind of says it all. The mean absolute error (MAE) reports how far, on average, is the predicted value from the expected one. All that directly in the units of the original target variable, making it easy to interpret.
Here, you can observe that the error is about 1.6% on average. But given the average rate of executions is 4.95%, it is quite a bit of spread.
Root mean squared error
Root mean squared error (RMSE) is a very similar metric but it is more sensitive to larger errors. Therefore, if there is a smaller proportion of observations that are very off, RMSE will penalise the resulting performance metric more than MAE.
It is a bit larger than MAE in this case, meaning that there are outliers with exceptionally high residuals. You’ll be looking at multiple models and evaluations, so let’s start building a summary allowing simple reporting and comparison.
Now, if you want to plot the predicted labels on a map, you can do that reliably only for the test sample. The training sample was seen by the model and would not be representative of model capabilities.
Working with this would be a bit tricky if we want to look into the spatial dimension of the model error.
However, you can create a map using the complete sample, just not using exactly the same model for all its parts. Welcome cross-validated prediction.
Cross-validated (CV) prediction splits the dataset (before you divided it into train and test) into a small number of parts and trains a separate model to predict values for each of them. In the example below, it creates five equally-sized parts and then takes four of them as train part to train a model that is used to predict values on the fifth one. Then, it switches the one that is left out and repeats the process until there are predicted values for every part. The resulting prediction should not contain any data leakage between train and test samples. However, as described below, that is not always the case when dealing with spatial data.
Cross-validation also allows you to assess the quality of the model more reliably, minimising the effect of sampling on the metric. You can simply measure the performance on the full array taking into account every of the five folds.
These results are not wildly off but the performance dropped a bit. The smaller the dataset (and this one is pretty small) the higher the effect of train-test split could be. Let’s refer to the cross-validated metrics as more reliable representation of the performance of the baseline model here.
Residuals
Another positive aspect of cross validation is that is allows use to retrieve a full sample of residuals. Unlike in linear regression, where residuals are part of the model, here you have to compute them yourself as a difference between expected and a predicted value.
Negative values mean the model have over-predicted the value, while the positive one means under-prediction. The optimal is zero. Check the residuals on a map.
Getting the standard deviation of the absolute value of residuals to help to reasonably stretch the colormap.
2
Get the minimum and maximum of the colormap as 5 standard deviations below or above zero.
Spatial distribution of residuals
The spatial pattern of error signifies issues, you should know that from the last session. You can optionally use some strategies covered there to mitigate it. Today, let’s look and more advanced spatial evaluation of the model performance.
Spatial evaluation
The metrics reported above are global. A single value per model but the map indicates that this varies in space. Let’s see how.
Spatially stratified metrics
Global metrics can be computed on regional subsets. We have an information about okres (county) and can try computing the metrics for each individual okres. To make it all a bit easier, assign the cross-validated prediction as a new column.
executions_data["prediction"] = pred_cross_val
Using groupby, you can group the data by "okres" and check the metric within each one. Better to measure metrics derived from real values than \(R^2\) as the latter is not well comparable across different datasets (each okres would be its own dataset in this logic).
And merge with the original data. The spatial metrics cannot be simply assigned as new columns as they are much shorter - only one value per okres. You need to merge on the okres values to assign it as new columns.
The spatial variation is evident. What is also evident are the boundaries between individual okres’s, suggesting a MAUP issue. At the same time, such an aggregation may not always be available.
The better option is to measure the spatial metrics using the Graph. You can define neighborhoods and measure the metric in each neighborhood individually, reporting a unique value per each focal geometry. In this case, you can assume that the daily mobility is not limited to neighbouring municipalities only, so let’s get a K-nearest neighbors with 100 neighbor (median number of municipalities in the okres is 79, so it should cover roughly similar scale). Using very small neighborhoods may result in the metrics jumping up and down erratically due to sampling issue.
Set geometry to centroids (on-the-fly) as the KNN constructor requires point geometries.
2
Assign self-weight to 1, effectively including focal geometry in its own neighbourhood.
3
Graph.apply works as an overlapping groupby.apply. You just need to give it the actual data as the first argument. Remember that the graph contains only the information about the relationship.
You can map the results again, observing much smoother transitions between low and high values, minimising boundary aspect of MAUP (the scale aspect is still present).
MAE and RSME measured within 100 nearest neighbors. Notice the more visible artifacts of large errors in the RSME map caused by the penalisation.
With these data, you can do any of the spatial analysis you are used to, like extracting local clusters of low or high performance or fixing the model to avoid these artifacts.
Spatial dependency of error
Let’s now focus on a direct spatial assessment of residuals. The map of residuals above indicates that there is some spatial structure to unpack, so let’s dive into the assessment of the spatial dependency of the model error.
Variogram
Let’s check the spread of the assumed autocorrelation. Is the dependency relevant only locally, regionally or nationally? To answer this question, you can build an experimental variogram, like you did when dealing with kriging. The variogram should then indicate the extent of autocorrelation.
Remember, that pyinterpolate assumes data in a specific structure, so let’s quickly prepare it. For details check the interpolation chapter.
Max range is 490km, which is the rough width of the dataset.
With the built semivariogram, you can explore its plot.
exp_semivar.plot(plot_covariance=False)
Semivariogram of the prediction error
The semivariance tends to grow nearly across the whole range, indicating that the autocorrelation does not disappear when considering larger regions. More, it seems that there is difference in performance across large parts of the country. In any case, the results clearly suggests that the model has a troubles with spatial heterogeneity of the relationship between independent variables and the target one.
LISA on residuals
One approach of determining spatial dependence of the residuals you are already familiar with is measuring local indicators of spatial autocorrelation. The variogram does not really help us in selecting the optimal neighborhood to work with, so let’s build a distance band graph with the threshold of 10 km.
Distance band builder also assumes point geometry.
Now, you have two options on how to assess local autocorrelation. When using the raw residuals, you can identify areas that are consistently over-predicted and those that are consistently under-predicted.
Make this Notebook Trusted to load map: File -> Trust Notebook
High-High clusters are those that are consistently under-predicted while Low-Low are those consistently over-predicted based on the spatial distribution of residuals.
Another option is to assess the absolute value of residuals and identify clusters of consistently correct and consistently wrong locations.